Hypothesis
The principle of machine learning model is to learn a conditional distribution. Models are different in condition generation method.
Linear network model
\[ \begin{bmatrix}x_{11}&x_{12}&x_{13}&\dots&x_{1m}\\x_{21}&x_{22}&x_{23}&\dots&x_{2m}\\\vdots&\vdots&\vdots&\ddots&\vdots\\x_{n1}&x_{n2}&x_{n3}&\dots&x_{nm}\end{bmatrix} \begin{bmatrix}w_{11}&w_{12}&\dots&w_{1h}\\w_{21}&w_{22}&\dots&x_{2h}\\w_{31}&w_{32}&\dots&w_{3h}\\\vdots&\vdots&\ddots&\vdots\\w_{m1}&w_{m2}&\dots&w_{mh}\end{bmatrix} \]
Mandatory requests
We hop that the train data matrix and validation data matrix both can represent the same distribution of \(\vec{X}_{1\times m}\).
Therefore, samples in data matrix should be independent.
If samples in train data matrix or validation data matrix are dependent, then there will be risk in effectiveness of prediction on validation data matrix.
If train data matrix and validation matrix are dependent, then there will be risk in effectiveness of prediction on test data matrix.
Linear networks can only learn linear relationship from predictors to response variable
The best model performance is only affected by: Raw Data, New Information, Linear Model.
I demonstrate the ideal model’s performance over different situations below.
import plotly.graph_objs as go
from plotly.subplots import make_subplots
x = np.linspace(0, 5, 1000)
y1 = 1 - np.exp(-3 *x)
y2 = (1 - np.exp(-2 * x)) * 0.8
y3 = (1 - np.exp(-1 * x)) * 0.5
fig = make_subplots(rows=1, cols=1)
fig.add_trace(go.Scatter(x=x, y=y1, mode='markers', name='line1', marker=dict(color='RoyalBlue')))
fig.add_trace(go.Scatter(x=x, y=y2, mode='markers', name='line2', marker=dict(color='red')))
fig.add_trace(go.Scatter(x=x, y=y3, mode='markers', name='line3', marker=dict(color='green')))
fig.add_shape(type="line",
x0=0, y0=1, x1=1, y1=1,
xref='paper', yref='y',
line=dict(color='RoyalBlue', width=3, dash='dash'))
fig.add_shape(type="line",
x0=0, y0=0.8, x1=1, y1=0.8,
xref='paper', yref='y',
line=dict(color='red', width=3, dash='dash'))
fig.add_shape(type="line",
x0=0, y0=0.5, x1=1, y1=0.5,
xref='paper', yref='y',
line=dict(color='green', width=3, dash='dash'))
fig.update_layout(
title="Model's Performance over Different Situations",
xaxis_title="Time Cost to Optimal State",
yaxis_title="Model's Performance"
)
fig.show()
Non-mandatory requests
If attributes are dependent, the best model performance won’t be affected.
If the attributes are independent, the best model performance won’t be affected.
One layer linear network
\[ \begin{bmatrix}x_{11}&x_{12}&x_{13}&\dots&x_{1m}\\x_{21}&x_{22}&x_{23}&\dots&x_{2m}\\\vdots&\vdots&\vdots&\ddots&\vdots\\x_{n1}&x_{n2}&x_{n3}&\dots&x_{nm}\end{bmatrix} \begin{bmatrix}w_{11}\\w_{21}\\w_{32}\\\vdots\\w_{m1}\end{bmatrix} = \begin{bmatrix}o_{11}\\o_{21}\\\vdots\\o_{n1}\end{bmatrix} \]
- If the scale of different attributes vary significantly, the steepness of their parameter space will be inconsistent, which may cause the loss function to oscillate and increase during training process. However, this problem seems to be solved in multi-layer networks.
Multi-layer linear network
Code
Below is a complete piece of code. I can directly modify and run the code to do experiment about MLP.
MyDataset
import numpy as np
import pandas as pd
import torch
from torch.nn import nn
from torch.utils.data import DataLoader, Dataset
from torchvision import transforms
from sklearn.model_selection import train_test_split
class MyDataset(Dataset):
def __init__(self, input_data, input_label, features_transform=None, labels_transform=None):
self.input_data = input_data
self.input_label = input_label
self.features_transform = features_transform
self.labels_transform = labels_transform
def __len__(self):
return len(self.input_data)
def __getitem__(self, idx):
feature = self.input_data[idx]
if self.features_transform:
feature = self.features_transform(feature)
label = self.input_label[idx]
if self.labels_transform:
label = self.labels_transform(label)
return feature, label
device = "cuda" if torch.cuda.is_available() else "cpu"
Compelete code
column1 = np.random.normal(5, 2, size=(10000,)).astype(np.float32)
column2 = np.random.normal(5, 2, size=(10000,)).astype(np.float32)
column3 = np.random.normal(10, 3, size=(10000,)).astype(np.float32)
column4 = np.random.normal(7, 10, size=(10000,)).astype(np.float32)
column5 = np.random.normal(1, 10, size=(10000,)).astype(np.float32)
column6 = np.random.normal(2, 1, size=(10000,)).astype(np.float32)
column7 = np.random.normal(100, 100, size=(10000,)).astype(np.float32)
temp = np.diff(column1)
column1_2 = np.append(temp, temp[-1])
column1_1 = (column1 + 1) * 1.5
column2_1 = column2 ** 2
column_ = 2 * column1_2 - 3 * column2
df = pd.DataFrame({'column1': column1, 'column2': column2, 'column3': column3, 'column4': column4, 'column5': column5, 'column6': column6, 'column7': column7,
'column1_1': column1_1, 'column2_1': column2_1, 'column1_2': column1_2})
myfeatures = torch.tensor(df.loc[:, ['column1_2', 'column2']].values).float()
mylabels = torch.tensor(column_).float()
train_features, val_features, train_labels, val_labels = train_test_split(myfeatures, mylabels, test_size=0.2, random_state=42)
train_dataset = MyDataset(train_features, train_labels, None, None)
val_dataset = MyDataset(val_features, val_labels, None, None)
model = nn.Sequential(nn.Linear(train_features.shape[1], 4, bias=False),
nn.ReLU(),
nn.Linear(4, 4, bias=False),
nn.ReLU(),
nn.Linear(4, 1, bias=False))
# model = nn.Sequential(nn.Linear(train_features.shape[1], 1, bias=False))
# def init_weights(m):
# if type(m) == nn.Linear:
# nn.init.xavier_uniform_(m.weight)
# # with torch.no_grad():
# # m.weight = nn.Parameter(torch.tensor([[0., 0.]]))
# model.apply(init_weights)
num_epochs = 100
batch_size = 256
lr = 0.001
model.to(device)
criterion = nn.MSELoss(reduction='mean')
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
myloss = []
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
for epoch in range(num_epochs):
train_loss = 0
model.train()
for batch_features, batch_labels in train_loader:
optimizer.zero_grad()
train_outputs = model(batch_features)
loss = criterion(train_outputs, batch_labels.reshape(-1, 1))
train_loss += loss.item()
loss.backward()
optimizer.step()
train_loss /= len(train_loader)
if epoch % 10 == 0:
val_loss = 0
model.eval()
with torch.no_grad():
for batch_features, batch_labels in val_loader:
val_outputs = model(batch_features)
loss = criterion(val_outputs, batch_labels.reshape(-1, 1))
val_loss += loss.item()
val_loss /= len(val_loader)
print('epoch {}/{} train loss: {:.2f}, val loss: {:.2f}'.format(epoch, num_epochs, train_loss, val_loss))
The average loss value calculating method in the code chunk above is an approximate method, because the number of samples in the last batch is likely to be smaller than the batch_size.